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Abstract. Perfectly rational decision-makers maximize expected util- 
ity, but crucially ignore the resource costs incurred when determining 
optimal actions. Here we propose an axiomatic framework for bounded 
rational decision-making based on a thermodynamic interpretation of 

^^ I resource costs as information costs. We show that this axiomatic frame- 

Cn . work enforces a unique conversion law between utility and information, 

which can be characterized by a variational "free utility" principle akin 
to thermodynamical free energy. This variational principle constitutes a 
normative criterion that trades off utility and information costs, the lat- 

OO ' ter measured by the KuUback-Leibler deviation between a distribution 

representing a desired policy and a reference distribution representing an 
initial default policy. We show that bounded optimal control solutions 
can be derived from this variational principle, which leads in general to 

■^L . stochastic policies. Furthermore, we show that risk-sensitive and robust 

^ ' (minimax) control schemes fall out naturally from this framework if the 

^ ^ environment is considered as an adversarial opponent. When resource 

costs are ignored, the maximum expected utility principle is recovered. 
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1 Introduction 



> 

\o 

C"""^ ' Rational decision- making is usually based on the principle of maximum expected 

"^ I utility (MEU) [T^. According to MEU, a rational agent chooses its action a so as 

t^^ ' to maximize its expected utility E[U|a] = J^s P(s|a)U(s) given the probability 

^D [ P{s\a) that action a £ A will lead to outcome s e 5 and given that the desirabil- 

ity of the outcome s is measured by the utility U(s) e R. Thus, expected utilities 
express betting preferences over lotteries with uncertain outcomes. The optimal 
action a* G ^ is defined as the one that maximizes the expected utility, that is 
►v> \ a* :— argmaxa E[U|a]. However, computing such optimal actions is often very 

5^ ' difficult in practice due to prohibitive resource costs that are associated with the 

5t ! process of finding the optimal action. Such resource costs are ignored by MEU. 

In contrast, a bounded rational decision-maker has only limited resources 
and cannot afford an unlimited search for the optimal action [llj . Therefore, 
such decision-makers have to trade off the utility that an action achieves against 
the resource cost of finding the action. Imagine, for example, you want to invest 
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2 Bounded Rationality 

some of your savings and you start reading up on several options, asking your 
local bank, etc. However, as a bounded agent in the real world you cannot extend 
this search forever, as you will loose out in the meanwhile. Therefore, you have 
to trade off somehow the time invested in this search and a satisfactory return 
from some investment option. 

In this paper we propose an axiomatic formalization of bounded rationality 
that leads to such a trade-off based on a thermodynamic interpretation of re- 
source costs [3]. The intuition behind this interpretation is that ultimately any 
real decision-maker has to be incarnated in a physical system, since any process 
of information processing must always be accompanied by a pertinent physical 
process [15] . Thermodynamics provides the tools to study these general physical 
systems. In Section 2 we discuss the thermodynamical notion of resource costs in 
information processing systems. In Section 3 we show how a set of simple choice 
axioms leads to a variational principle that allows computing bounded optimal 
policies in systems with resource costs. In Section 4 we apply this framework 
and show how to derive bounded optimal solutions for decision-making under 
resource costs in different environments. We also show how to obtain classic 
maximum expected utility solutions in the limit of negligible resource costs. 



2 Resource Costs 

In the following we conceive of information processing as changes in information 
states, i.e. ultimately changes of probability distributions that are represented 
in physical systems. Changing an information state therefore implies changes 
in physical states, such as flipping gates in a transistor, changing voltage on a 
microchip, or even changing location of a gas particle. Changing such states is 
costly and requires thermodynamical work [?]. Imagine, for example, that we use 
an ideal gas particle in a box with volume Vi as an information processing system 
to represent a uniform probability density over a random variable with pi — —. 
If we now want to update this probability to pf , because we gained information 
— logp = — log — > 0, we have to reduce the original volume to Vf = pVi. How- 
ever, this decrease in volume requires the work W — — jy^ ^^^dV — NkT In y-^ 
where iV is the number of gas molecules, k is the Boltzmann constant, and T is 
temperature. Thus, in this simple example we can compute the relation between 
the change in information state and the required work, that is W = — alogp, 
with a = j^^— ^ > being the conversion factor between information and energy. 
The conversion factor a depends on the underlying properties of the physical 
system and determines how expensive it is to process information. In the next 
two sections, we derive a general expression of information costs for physical 
systems that represent bounded rational decision-makers. Since such decision- 
makers need to trade off utility and information costs, we will first investigate 
the relation between information and utility [T^ and then show how informa- 
tion costs appear as an additional term in the utility in physically implemented 
decision- makers. 



Bounded Rationality 3 

3 Conversion between utility and information 

3.1 Choice SLxioms 

Consider a decision-maker whose behavior is represented by a probabihty space 
(i7, T , P) with sample set fl and cr-algebra T of measurable events between 
which the decision-maker can choose. We assume that the decision-maker can 
choose freely any probability measure P representing his choice behavior. Thus, if 
P(A) > P{B), then the propensity of choosing A is higher than that of choosing 
B. This difference in probability can be given a utilitarian interpretation: A 
is chosen with higher probability than B because A is more desirable than B. 
The measure that quantifies such differences in desirability is commonly called 
a utility function. If there is such a measure, then it is reasonable to demand 
the following properties: 

i. Utilities should be mappings from events into real numbers, 
ii. Absolute values of utility are irrelevant, only relative differences in utility 

should matter ("utility gains"). 
iii. Utility gains should be additive, 
iv. A decision-maker should assign more probability mass to events with high 

utility and less probability mass to events with low utility. 
V. An adversarial agent should make the reverse assignment of probability mass. 

These postulates are summarized in the following definition. 

Definition 1 (Axioms of Choice). Let (i7, J^, P) be a probability space. A set 
function U : 7^ — > R is a utility function for a decision- maker with probability 
measure P iff its utility gain function w.{A\B) := JJ{A Ci B) — U(_B) has the 
following three properties for all events A, B,C, D G J-: 

Al. 3/,u(A|B) = /(P(A|B)) eM, (real-valued) 

A2. u(AnB|C) ==u(A|C)+u(B|AnC), (additive) 

A3. V{A\B) > P{C\D) ^ u(A|B) > u{C\D). (monotonia increasing) 

If the decision-maker is an adversarial opponent, the inequality of A3 is reversed 

A4. P(A|B) > P{C\D) ^ u(A|S) < u{C\D). (monotonia decreasing) 

Furthermore, we use the abbreviation u{A) := u{A\f2). 

The following theorem shows that these three properties enforce a strict 
mapping between probabilities and utility gains. 

Theorem 1 (Utility Gain ^-> Probability). // / is such that u{A\B) = 
f(P{A\B)) for any probability space {Q, J-, P), then f is of the form 

/(•)-"log(-), 

where a is an arbitrary strictly positive constant in case of A3 or an arbitrary 
strictly negative constant in case of A^. 



4 Bounded Rationality 

The proof is provided elsewhere J10I8J . If one is wihing to accept Definition [I] 
then one obtains the relations 

V{AnB)-lJ{B)^a\ogP{A\B). (1) 

In this relation, a plays the role of a conversion factor between utilities and infor- 
mation. A bounded rational decision-maker is characterized by a > 0, whereas 
an adversarial opponent can be described by a < 0. Unless otherwise stated, 
we will assume a > in the following. If a probability measure P and a utility 
function U satisfy the relation (IT|), then we say that they are conjugate. Given 
that this transformation between utility gains and probabilities is a bijection, 
one can rewrite any probability P{A\B) as a Gibbs measure: 

V exni^t'^) 

Eceijexp-UC^) 

where we have used the abbreviation U(w) := U({a;}). This transformation 
implies that the probability measure P is the Gibbs measure with temperature 
a and energy levels e(a;) :== —XJ{{lu}). As the conversion factor a approaches 
zero, the probability measure P(w) approaches a delta function Saj*{uj) with 
w* = argmaX[^ \J{uj), or in case of several maxima the uniform distribution over 
the maximal set ^max '■— {^* G f2\uJ* — argmax^j U(a;)} . Similarly, as a — ^ cx), 
P(a;) — !■ TjjT, i.e. the uniform distribution over the whole outcome set il. 

3.2 Variational principle 

It is well known in statistical physics that the Gibbs measure satisfies a vari- 
ational problem in the free energy [3]. Since utilities correspond to negative 
energies, we can formulate a free utility principle that is maximized by a decision- 
maker that acts according to ([2]). 

Theorem 2. Let X be a random variable with values in X. Let P and U be 
a conjugate pair of probability measure and utility function over X. Define the 
free utility functional as 

J(Pr;U) := ^ Pr(a;)U(a;) - a ^ Pr(a;) logPr(a;), 

where Pr is an arbitrary probability measure over X . Then, 

J(Pr;U) < J(P;U) = U(i7). 

A proof can be found in [7] . The free utility is a combined measure of a system's 
expected utility and its uncertainty. The variational principle implies that the 
Gibbs measure P maximizes the free utility for a given utility function U, as 
P = argmaxpr J(Pr;U). 

The variational principle of the free utility also allows measuring the cost of 
transforming the state of a stochastic system required for information processing. 
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Consider an initial system having probability measure Pi and utility function 
Uj. This system satisfies the equation 

xex xex 

If we add new constraints represented by the utility function U» then the re- 
sulting utility function U/ is given by the sum 

and the resulting probability measure Ff maximizes 

J(Pr, U/) = Yl P^(2^)U/(a;) - a ^ Pr(a;) log Pr(a;) 
xex xex 

= Y 'Pr{x)(V,{x) + V^x)) ~aY Pr{x) logPr(2;) 



xex xex 



J2 Fr{x)V,{x) ~aJ2 ^i^) log ^t\ + U^^^)' 



xex xex 

Let 3f := J(P/, U/). The difference in free utility is 

3f -J.^Y P/(^)U*(^) - « E P/(^) l°g &M- (3) 

xex xex *^ ^ 

In physical systems with constant a, this difference measures the amount of 
work necessary to change the state of the system from state i to state /. The 
first term of the equation measures the expected utility difference U* (x) , while 
the second term measures the information cost of transforming the probability 
distribution from state i to state /. These two terms can be interpreted as deter- 
minants of bounded rational decision-making in that they formalize a trade-off 
between an expected utility U, (first term) and the information cost of trans- 
forming Pi into P/ (second term) . In this interpretation Pi represents an initial 
probability or policy, which includes the special case of the uniform distribution 
where the decision-maker has initially no preferences. Deviations from this ini- 
tial probability incur an information cost measured by the KL divergence. If this 
deviation is bounded by a non-zero value, we have a bounded rational agent. 

In thermodynamics there are two dominant formulations of the second law 
that allow determining the equilibrium distribution: the first and maybe more fa- 
miliar formulation is the principle of maximum entropy, and the second principle 
is the principle of minimum energy 3 . The corresponding variational problems 
are typically formulated such that in the case of maximum entropy we hold 
the mean energy fixed (i.e. in our case the expected utility), and in the case of 
minimum energy (i.e. in our case maximum utility) we hold the entropy fixed. 
Mathematically, the constraints of fixed entropy and fixed utility are added by 
Lagrange multipliers. In our context with respect to equation [3] this leads to two 
different variational principles: 
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1. Control. The minimuni energy principle translates into a bounded maxi- 
mum utility principle. Given an initial policy represented by the probability 
measure Pi and the constraint utilities U*, we are looking for the final sys- 
tem Ff that optimizes the trade-ofF between utility and resource costs. That 
is, 



P/ = argniax ^ Pr(x)U, (x) — a ^ Pr(a;) log ^ ^ ^ . (4) 

x£X xex 

The solution is given by 



P/(a;) ocPi(a;)exp( — U»(x' 



In particular, at very low temperature a « 0, (121) becomes 

x£X 

and hence resource costs are ignored in the choice of P/, leading to P/ pa 
Sx'ix), where x* — max^; U»(a;). Similarly, at a high temperature, the dif- 
ference is 

J/-J.«-aEP^(x)log|^, 

and hence only resource costs matter, leading to P/ ~ Pj. 
2. Estimation. The maximum entropy principle translates into a minimum 
relative entropy principle for estimation. Given a final probability measure 
Ff that represents the environment and the constraint utilities U*, we are 
looking for the initial system P^ that satisfies 

P, = arg m^x ^ P/ (x)U, (x) - a ^ P/ (x) log ^^ (5) 

xex xex ^ ' 

xGX ^ ' 

and thus we have recovered the minimum relative entropy principle for esti- 
mation, having the solution 

P. = P/ 

The minimum relative entropy principle for estimation is well-known in the lit- 
erature as it underlies Bayesian inference [S], but the same principle can also 
be applied to problems of adaptive control [3]. In the following we focus on 
applications of the first principle on bounded optimal control. 

4 Applications 

Consider a system that first emits an action symbol x\ with probability P[)(x\) 
and then expects a subsequent input signal xi with probability Pi^(xi\x\). Now 
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we impose a utility on this decision-maker that is given by U{xi) for the first 
symbol and U{x2\xi) for the second symbol. How should this system adjust 
its action probability P{xi) and expectation P{x2\xi)7 Given the boundedness 
constraints ci and C2 on the relative entropies, the variational problem is given 

by 

(p(xi) \ 

y'p(xi)log r-Ci + y^p{xi)p{x2\xi)U{x2\xi) 



13 ^ p{xi)p{x2\xi)log 



P{X2\XI 

Pa{x2\xi 



C2 



with a and j3 as Lagrange multipliers. We can rewrite this sum as a nested 
expression and drop all constants 



max 

p(xi)p(x2\xi) 



E^( 



Xl) 



U{xi) — a\o^ 



Po{xi) 



^p{x2\xi) 



U{x2\xi) - j3\o^ 



P{x2\xi) 



We have then an inner variational problem: 
max y ^p{x2\xi 

p(x2\xi) ' 

X2 

with the solution 



«, P{x2\xi) 

Plog — 7 — ^ — ^ + U(x2\xi) 

P0{X2\Xi) 



p{x2\xi) = — Po(a;2|a;i)exp ( --U{x2\xi) 



Po{x2\xi) 
(6) 

(7) 



and the a;i-dependent normalization constant 

Z2 = ^po(a;2|a;i)exp ( -U{x2\xi) 



Po{xi) 



and an outer variational problem 
max^ ^p{xi) 

P("^) x^ 

with the solution 
P{xi) = ^Po(a;i)exp ( - (C/(xi) + 13 log Z 2) 

Z\ \OL 



(8) 



(9) 



— Po(a;i)exp - C/(xi) + ^log V'po(a;2|a;i) exp ( --U{x2\xy_ 



and the normalization constant 

Zi =^po(a;i)exp(-(C/(xi)+/31ogZ2 



= X!^o(^i)^^P ~ t/(xi) +/31og^po(a;2|2^i)exp f -U(x2\xx) 
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For notational convenience we introduce A = ^ and fi = ^. Depending on the 
values of A and /i we can discern the following cases: 

1. Risk-seeking bounded rational agent: A > and fi > 

When A > the agent is bounded and acts in general stochastically. When 
/U > the agent considers the move of the environment as if it was his own 
move (hence "risk-seeking" due to the overtly optimistic view) . This follows 
immediately from the choice axioms presented in section [01 We can also see 
this from the relationship between Zi and Z2 in ([2]), if we assume fi = \ and 
introduce the value function Vt — \ log Zt, which results in the recursion 

Vt-i = ^ log 51 Poi^t-i\-) exp (A (Uixt^il-) + Vt)) . 

xt-l 

Similar recursions based on the log-transform have been previously exploited 
for efficient approximations of optimal control solutions both in the discrete 
and the continuous domain |2I6I14J . In the perfectly rational limit A -^ +00, 
this recursion becomes the well-known Bellman recursion 

V;_,=max{U{xt-i\-) + V;) 

xt-\ 

with y/ = limA^+00 "Vt- 

2. Risk-neutral perfectly rational agent: A -^ +00 and /i ^' 

This is the limit for the standard optimal controller. We can see this from 
© by noting that 

lim -logV'po(a;2|a:i)exp(^L/(a;2|a::i)) = 'S^ Vq{x2\xx)U {x2\x{) , 
A»->0 n ^ — ' ^ — ' 

X^2 X2 

which is simply the expected utility. By setting U{xi) = 0, and taking the 
limit A — >■ +CXD in (|ni), we therefore obtain an expected utility maximizer 

p{xi) = 5{xi -xl) 

with 



x\ = argmaxy^po(a^2ki)^(a;2|a;i) 



Xl 

X2 



As discussed previously, action selection becomes deterministic in the per- 
fectly rational limit. 
3. Risk-averse perfectly rational agent: A -^ +00 and fi < 

When n < the decision-maker assumes a pessimistic view with respect 
to the environment, as if the environment was an adversarial or malevolent 
agent. This attitude is sometimes called risk- aversion, because such agents 
act particularly cautiously to avoid high uncertainty. We can see this from 
^ by writing a Taylor series expansion for small /i 



-log^Po(a;2|a:i)exp(^[/(a;2|a:i)) ^ E[U] - 



Bounded Rationality 9 

where higher than second order cumulants have been neglected. The name 
risk-sensitivity then stems from the fact that variability or uncertainty in 
the utility of the Taylor series is subtracted from the expected utility. This 
utility function is typically assumed in risk-sensitive control schemes in the 
literature [TB] , whereas here it falls out naturally. The perfectly rational actor 
with risk-sensitivity /i picks the action 

p{xi) = 6{xi -X*) 

with 

nr . — ctrrf tti q^v \r\n 



argmax — logy^Po(a;2|a;i) exp{p.U{x2\xi)) 



xi n 

X2 

which can be derived from ^ by setting U{xi) = and by taking the limit 
A — >■ -l-cxD. Within the framework proposed in this paper we might also inter- 
pret the equations such that the decision-maker considers the environment 
as an adversarial opponent with bounded rationality /i. 
4. Robust perfectly rational agent: A -^ -l-oo and /i — ?► — oo 

When /i — ^ — cxD the decision-maker makes a worst case assumption about 
the adversarial environment, namely that it is also perfectly rational. This 
leads to the well-known game-theoretic minimax problem with the solution 

xl ~ argmaxargmin[/(a;2|a;i), 

Xi X2 

which can be derived from ^ by setting U{xi) = 0, taking the limits A — >■ 
-|-oo and /i — > — oo and by noting that p{xi) = 6{xi —Xi). Minimax problems 
have been used to reformulate robust control problems that allow controllers 
to cope with model uncertainties [T] . Robust control problems are also known 
to be related to risk-sensitive control [T] . Here we derived both control types 
from the same variational principle. 

5 Conclusion 

In this paper we have proposed a thermodynamic interpretation of bounded ra- 
tionality based on a free utility principle. Accordingly, bounded rational agents 
trade off utility maximization against resource costs measured by the KL di- 
vergence with respect to an initial policy. The use of the KL divergence as a 
cost function for control has been previously proposed to measure deviations 
from passive dynamics in Markov systems |13ll4j . Other methods of statisti- 
cal physics have been previously proposed as an information-theoretic approach 
to interactive learning |12j and to game theory with bounded rational players 
|19j . The contribution of our study is to devise a single axiomatic framework 
that allows for the treatment of control problems, game-theoretic problems and 
estimation and learning problems for perfectly rational and bounded rational 
agents. In the future it will be interesting to relate the thermodynamic resource 
costs of bounded rational agents to more traditional notions of resource costs 
in computer science like space and time requirements when computing optimal 
actions 116 . 
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